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Abstract 

Family data and rare variants are two key features of whole genome sequencing analysis for hunting the missing 
heritability of common human diseases. Recently, Zhu and Xiong proposed the generalized tests that combine 
rare variant analysis and family data analysis. In similar fashion, we developed the extended tests for longitudinal 
whole genome sequencing data for family-based association studies. The new methods simultaneously incorporate 
three correlation sources: from linkage disequilibrium, from pedigree structure, and from the repeated measures of 
covariates. We assess and compare these methods using the simulated data from Genetic Analysis Workshop 18. 
We show that, in general, the extended tests incorporating longitudinal repeated measures have higher power 
than the single-time-point tests in detecting hypertension-associated genome segments. 



Background 

Compared with traditional genome-wide association stu- 
dies (GWAS), whole genome sequencing (WGS) is a 
more efficient way of finding the missing heritability of 
diseases [1]. While GWAS are mostly based on microar- 
ray genotyping, which can discover only common 
genetic variants, WGS is able to reveal rare and struc- 
tural variants, which are crucial factors behind disease 
phenotypes [2]. As the cost of sequencing decreases sig- 
nificantly, we expect that WGS will become increasingly 
predominant in the hunt for novel disease genes. 

Most of the recent discoveries from sequencing stu- 
dies were based on the Mendelian trait model [3]. 
Genetic association studies based on the complex trait 
model are challenging because of limited sample size as 
well as the new properties of sequencing data. WGS 
data are distinct from GWAS data in two major aspects. 
First, WGS provides a huge number of rare variants. 
Even with large allelic effects, caused by very small 
minor allele frequencies (MAFs), the association tests 
between single rare variants and the trait are less power- 
ful and unreliable [4]. Second, family designs play a cri- 
tical role in WGS. Because of its relatively high cost. 
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WGS tends to exploit families of patients, so that the 
rare causal variants are likely enriched through cotrans- 
mission of the disease [5]. Furthermore, the pedigree 
structure allows statistical imputation of the genotypes 
at no experimental cost, which potentially increases the 
statistical power [6,7]. 

In a recent celebrated work, Zhu and Xiong proposed 
a set of generalized tests for family-based WGS data 
[8]. These methods simultaneously address the correla- 
tions among genetic variants (i.e., linkage disequilibrium 
[LD]) and the correlations among family members (i.e., 
kinship). Rare-variant collapsing procedures [9,10] are 
also integrated into the tests. However, these methods 
cannot incorporate covariates and do not address the 
correlation structure for longitudinal repeated measures. 
In this study, we further extended the methodology of 
the T tests to address these limits. By applying these 
methods to an analysis of the Genetic Analysis Work- 
shop 18 (GAW18) simulation data, we showed that the 
asymptotic null distributions of Zhu and Xiong [8] are 
problematic in controlling the type I error rate, and that 
our extended methods are likely more powerful for 
longitudinal data. 
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Methods 

Generalized 7^ tests for family data 

Zhu and Xiong [8] showed that the covariance as a resuh 
of both LD and kinship could be exphcitly expressed as a 
Kronecker product of the corresponding covariance 
matrices. Following the idea of Hotelling's test [11,12], 
the authors proposed a generalized test that incorpo- 
rates these covariance matrices, which are estimated 
separately by using the same data. Depending on various 
strategies of collapsing of rare variants, here we consider 
three generalized T tests of Zhu and Xiong. 

T : The genotypes of rare variants between adjacent 
common variants are summed up, and one covariance 
matrix is estimated for both common and collapsed rare 
variants. 

CMC.ZXpaper (CMC test): The rare variants are col- 
lapsed in the same way as above, but the covariance 
matrices are estimated separately for common and rare 
variants (assuming they are uncorrelated). 

CMC.ZXcode: Rare variants are collapsed by the max- 
imum of their genotypes, and one covariance matrix is 
estimated for both common and collapsed rare variants. 
This strategy follows the R function pedCMC of Zhu 
and Xiong (https://sph.uth.edu/hgc/faculty/xiong/soft 
ware-D.html). 

Extended 1^ test for family data with longitudinal 
repeated measures 

Building on the idea of Zhu and Xiong, we further extend 
the generalized T^ tests to account for the longitudinal 
repeated covariates. Figure 1 shows the data structure 
and the idea of the extension. Specifically, the extended 
T^ tests compare the blocks of common variants, rare 



variants, and covariates with repeated measures in cases 
and in controls, while simultaneously accounting for the 
correlations among genetic factors, among pedigree indi- 
viduals, and among longitudinal repeated measures. The 
response is the occurrence of the event at any of the 
measurement points. 

Following the notations in Figure 1, let be the num- 
ber of the cases, nd be the number of the controls, and 
n = tic + The genotype column vector of the tth com- 
mon variant is Z' = [Z\, . . . ,ZJ,)', the aligned column 
vector of all T common variants is represented by 
Z = (Z^', . . . , Z^')'. Similarly, for the collapsed genotypes 
of rare variants, the genotype column vector of the sth 
rare variant is V = {V{, V'^)', and V = {V^' , V^')' 
for totally S rare variants. Considering the covariates with 
longitudinal repeated measures, the column vector of the 
cth covariate at the /th repeated measurement point is 
A'' = (Aj*, . . . , Aji')', and the aligned column vector is 

A = (A"' Ai'U'^' A^'' A"' ACi'y for totally 

C covariates, each measured for / times. Similarly, the row 
vectors are denoted as follows. For i = 1, . . . ,n, the vectors 
Zi = [Zj , . . . , Zjy are the rows in the block of common 
variants, the row vectors V, = {Vl, Vf)' are for rare var- 
iants, and Ai = {Al\...,Ay,Af,...,Af,...,Af\...,Af'y 
are for longitudinal covariates. The row average in 
cases is Zc = X^Si ^il^o and that in controls is 
Zd = > Zijud. The row averages for rare variants and 

^—'i=n, + l " 

covariates are obtained analogously. 

The idea of the extended T^ test is simply to compare 
the difference between the row average of the case blocks 
and the row average of the control blocks. Let 
r] = (Z! , V', A')'. The difference between row averages can 
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Figure 1 Data structure for composing tlie extended T tests. Data contain 3 blocks: common variants, rare variants, and longitudinal 
covariate measures. The statistics integrate the correlations among both rows and columns, and test whether there exists a significant difference 
between the row vector mean of the cases and that of the controls. 
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be written in terms of 1. That is 



Hr] 




(1) 



where if we define Dr = (ui, ... , Un)', with m, = 1 for 
cases i= \, ... ,nc and m, = 0 for controls i = Uc + I, . . . ,n, 
and denote 1 as a vector of 1 of length n and I[k) as an 
identity matrix of dimension k, then the matrix 



//(T)® (Dr- -1) 0 



(2) 



Following the idea of the generalized test, 
the extended test is = (H)])T"^ (Hj?) , where 
A = Var (r]) ., A = Var (rj) . 

The key problem is to estimate A. Following the 
assumption of Zhu and Xiong [8] that Z and V are 
independent, we consider two cases. In the first case, 
assume A is also independent with Z and V. Then 
Varir]) = A = diag(Az, Av, Aa), where by Az = Var (Z) = Sz ® * 
and Ay = Var (V) = (g) O. 

Ez and Ey are the covariance matrix among the ele- 
ments in Z, and Vj, respectively (e.g., the LD among the 
genetic variants), O is the kinship matrix, and (8) denotes 
the Kronecker product. For the covariate block, we con- 
sider Aa = Var (A) = Ea <8) 4>*, where Ea is the covar- 
iance matrix among the elements Ai, and cj)* is a matrix 
that captures the correlations among individuals in 
terms of environmental covariates. 

To better account for the heterogeneity of the data in 
cases and in controls, we applied the method in Hotell- 
ing's T^ test for estimating the covariance matrix (which 
is different from equation (6) in Ref [8]). Then equation 
(3) is simplified to 



(d,-5:i)'»(d,-5:i) 



(d,-Ji)'«.(d,-^i) 



(3) 



We consider two simplification assumptions: (a) 
<t>* = I indicates that covariate variables among indivi- 
duals are independent, considering the individual depen- 
dence has been captured by the genetics; and (b) = $ 
indicates that covariate variables among individuals have 
the similar dependence pattern as that according to 
genetics (e.g., children may be more likely to smoke if 
parents do, or the age of children is correlated with the 
age of parents). According to the various rare-variant 
collapsing strategies in the above generalized T^ tests by 
Zhu and Xiong [8], the corresponding extended T^ tests 
are denoted T2.1ongi, CMC.ZXpaper.longi, and CMC. 
ZXcode.longi, respectively. 



Asymptotic and permutation tests 

Zhu and Xiong derived asymptotic chi-square distribu- 
tion for the null. In their paper [8], the degrees of free- 
dom (DF) equal the number of variants; in their R code, 
the DF equal the rank of data matrix. The latter is bet- 
ter but still gives inflated p values as shown below. 
Thus, we applied a permutation test for the type I error 
rate being well controlled. Specifically, let r| and T^p 
I = I, ... ,L, 1= 1, . . . , L, denote the test statistics of the 
^h genome window from the original data and from the 
Zth permutation, respectively. The empirical p value 

for the ^th window is = # jr^ > T^, / = 1, . . . , l} /L, 

where L =1000. Because the target is to find the associa- 
tions with genetic variants, not with the covariates, the 
permutations are applied only to the genotype data to 
retain the relationship between response and the 
covariates. 

Results 

For evaluating the above methods, we used the "dose" 
genotype data of 1,215,399 single-nucleotide variants 
(SNVs) on chromosome 3 and the 200 simulation repli- 
cates of hypertension outcomes and covariates (age, 
hypertension medicine use, smoking status). As an arbi- 
trary, yet simple, way to group variants, we split chr3 
into 19,080 windows, each 10 kilobase pairs (kbp) long. 
In each window, rare variants (MAF <0.05) between 
adjacent common variants were collapsed, leaving 
654,415 genetic factors (common or rare variants, or 
collapsed rare-variant groups) to be analyzed. The aver- 
age number of genetic factors contained in the windows 
is 34.3, the median is 32, the minimum is 1, and the 
maximum is 330. For the simulated phenotypes, the 
number of individuals is 849 in 20 families, where the 
family sizes are from 21 to 74, with the mean 42.45 and 
the median 36.5. There are 188 simulated true SNVs 
contained in 129 true windows (1, 3, 7, 32, and 86 windows 
contain 5, 4, 3, 2, and 1 true SNVs, respectively) on chr3. 
The knowledge of these true SNVs was used only for eval- 
uating the power of these association tests, not for design- 
ing data analysis strategy. 

To assess the asymptotic null distributions of the tests 
provided in Zhu and Xiong [8], we obtained the asymptotic 
p values of these tests for all false windows in chr3. The 
Q-Q plot of Figure 2 shows that all three methods have 
inflated p values with large genomic inflation factors X [13]. 
For example, when one chooses a p value cutoff of 0.05, 
the actual (empirical) error rate is approximately 0.1. At 
the same time, the following results show that permutation 
test controls the type I error rate well. Thus, the inflated 
type I error rate is likely caused by the inappropriate 
asymptotic null distributions, not by possible stratification. 



Liu et al. BMC Proceedings 2014, 8(Suppl 1):S40 
http://www.biomedcentral.eom/1753-6561/8/S1/S40 



Page 4 of 6 



/ N 

QQ-plot based on chr3 false windows 




0 1 2 3 4 5 6 



Expected 

Figure 2 Q-Q plot for asymptotic p values of Zhu and Xiong's 
generalized tests. Results are based on the 18,951 false 10 kbp-long 
windows on chr3. 



We studied the power of these tests in detecting true 
windows over chr3. Based on the phenotype data in the 
simulation replicate 1, the right panel of Figure 3 shows 
the receiver operating characteristic (ROC) curve for 



power (estimated by the true positive rate) over a variety 
of p value cutoffs. In general, the power is low at small or 
moderate p values. This phenomenon indicates that the 
sample size is still relatively too small for detecting many 
weak genetic effects simulated in the data. At the same 
time, it is clear that the 3 extended T tests that incorpo- 
rate longitudinal information are significantly better than 
the generalized T tests that only consider the measures at 
the first time point. Because the two setups: = / and 
(t>* = (J) in (3) led to similar results, we only report that by 
O* = / for simplicity. The left panel of Figure 3 shows that 
the permutation test controls the type I error rates well 
for all methods. 

To compare the overall capabilities of these tests, we 
studied their power (i.e., true positive rates) in detecting 
each of all windows over 200 simulation replicates. As illu- 
strated in Figure 4, there are 4 representative patterns of 
the comparisons for the 129 true windows on chr3. In par- 
ticular, 93 windows have longitudinal extended T^ tests 
more powerful than generalized T^ tests (illustrated in Fig- 
ure 4, left panel), 5 windows have similar results for both 
(Figure 4, middle panel), 15 windows have generalized T^ 
tests more powerful (Figure 4, right panel), and the 
remaining 16 windows have almost no power for any tests. 
So, the longitudinal extended T^ tests are significantly 
superior to the single-time-point generalized T^ tests (93 
vs. 15, p value = 3.8e-15 based on binomial model). For all 
windows, the type I error rates of all methods were well 
controlled (results are available upon request). 



Type I error rates based on 129 true windows 



Power Curve based on 129 true windows 
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Figure 3 Type I error rate and power of detecting true windows on chr3. The power Is the percentage of all 129 true windows that are 
detected at various p value cutoffs. The type I error rate Is the same percentage of these windows, except the genotypes were permuted to 
destroy the genetic associations. 
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Figure 4 Three patterns of comparisons between the longitudinal tests and one-time-point tests. Left: longitudinal tests are better 
(illustrated by window 4703, 79 windows in total); middle: they are similar (illustrated by window 1330, 19 windows in total); right: one-time- 
point tests are better (illustrated by window 13575, 5 windows in total). 



Discussion 

In the simulation data of GAW18, true SNVs are always 
allocated on genes. Using genes as windows to group 
SNVs may concentrate the true SNVs and has the 
potential to improve the detection power. However, the 
idea of WGS, instead of exome sequencing, is that the 
disease-related genetic factors might allocate outside of 
genes. So we did not use the knowledge that true SNVs 
are in genes; instead, we evaluated the methods based 
on fixed-genome segment windows. 

There are several limitations and future research topics 
based on the current work. First, matrix estimation is a 
key issue in this methodology development. Good estima- 
tion of matrices and their inverses can better incorporate 
correlation structures' potential to improve the perfor- 
mance. Here we simply adopted the same variance matrix 
estimate in Hotelling's T test. This is a maximal likeli- 
hood estimate if observations are independent. Unfortu- 
nately, independency is not true for family data in the first 
place. Besides the correlation issue, for a high-dimensional 
matrix with a potentially sparse structure, there are better 
estimates of the covariance matrix and its inverse [14]. 
Second, the permutation test is relatively slow, especially 
for handling large amounts of data in WGS. It would be 
desirable to derive more accurate asymptotic distributions 
for fast and precise p value calculation. Third, necessary 
modifications of these tests are needed to handle missing 
data and unequal numbers of repeated measures, which 
are common problems. 

Conclusions 

We have extended Zhu and Xiong's [8] generalized T^ 
tests to incorporate the covariates with longitudinal 
repeated measures. These methods account for 3 sources 
of correlation structures among genetic variants, family 



members, and time series observations. Compared with 
the T test methods for snapshot observations, the new 
methods have higher power to detect hypertension- 
related genome segments according to the GAWIS simu- 
lation data. 
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